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ABSTRACT , . ^ r 

A population data set was randomly generated from 

which a random sample was drawn. This sample was randomly divided 

into two data sets, one of which was used to generate parameter 

estimates, which were then used in the second data set for 

cross-validation purposes. The best variable subset models were ^ 

compared between the two data sets on the R-squared and the Mallows 

C(p) criteria for best model selection. The cross-validation method 

postulated a correlated predictor set. The parameter estimates, 

standard errors, and t values of the best variable subset models were 

then compared between the multiple regression approach with 

correlated predictors and the principal components method that 

creates orthogonal predictor variables. The Mallows' C(p) values were 

inflated and did not always indicate the best variable subset model 

upon cross validation. The R-squared values are the same regardless 

of correlated or orthogonal predictors; therefore, parameter 

estimates and standard errors in a principal components analysis 

should be investigated. This is especially the case in the presence 

of multicolinearity in the best variable subset model predictor set. 

The use of PROC IMl procedures for cross validation is discussed. Ten 

tables and one figure illustrate the discussion. An appendix presents 

analysis programs. (Contains 23 references.) (Author/SLD) 
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ABSTRACT^ 



A cross validation comparison of the Mallows Cp subset model 
selection criteria using randomly generated data sets indicated 
that different subset models may be identified. The principal 
corr.ponent regression method using Type II sum of squares with 
orthogonal principal component variables indicated a slightly 
different set of "best" variables. The two methods in the presence 
of mult icollinearity can yield different subset models. It is 
recommended that researchers base regression models on substantive 
theory, model validation, and effect sizes for proper model testing 
and interpretation . 



'SPSS-X has program commands which permit the cross validation of 
results in multiple regression, however, SAS does not. Morpover, 
SAS outputs the Mallows Cp statistic, but SPSS-X regression 
procedures do not. These factors have prompted th^ use of PROC IML 
procedures to permit cross validation and the output of the Mallows 
Cp statistic. 



A Comparison of the Mallows Cp 
and 

Principal Component Regression Criteria 

for 

Best Model Selection in Multiple Regression 



Multiple regression permits model testing wherein a set of 
independent variables are hypothesized to predict a dependent 
variable. Oftentimes when che set of variables selected do not 
significantly predict, the researcher searches for a "subset" of 
variables that provides the best prediction model. The statistical 
packages provide several stepwise methods for this purpose. 

A review of the literature, however, indicated that most 
researchers misuse stepwise methods in determining the best 
predictor set or interpreting the importance of predictor variables 
(Huberty, 1989; Snyder, 1991; Thompson, 1989; Thompson et al., 
1991, Welge, 1990) . Tracz, Brown, and Kopriva (1991) summarized 
much of the literature to indicate that the results of stepwise 
procedures do not yield a "be^^t" equation because different 
criteria can be used in the selection of different sets of 
variables; that when variables are intercorrelated, there is no 
satisfactory way to determine the relative contribution of the 
variables to R~squared because various subsets of variables could 
yield a similar R-squared value; that stepwise methods inflate Type 
I error rates by not using the correct degrees of freedom in 
calculating the change in R-; and that the order of variable entry 
is incorrectly interpreted as defining the importance of the 
variable or "best set" of predictors. 
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Current research literature indicates that the all possible 
subset approach is preferred over the stepwise methods in 
determining the best model (Berk, 1977; Cummings, 1982; Thayer, 
1986; Davidson, 1988; Henderson & Denison, 1989; Welge, 1990; 
Thayer, 1990; Tracz, Brown, & Kopriva, 1991). Several criteria, 
however, are available for selecting the best subset model: R- , 
Adj. R-^, MSE, C.., or the principal component regression method. 
Constas and Francis (1992) presented a graphical method for 
selecting the best subset regression model using R" and Adj. R'' . 
They plotted R" and Adj . R^ against the number of predictors in the 
model. The maximum number of predictors for best subset model was 
determined at the point where the R^ and/or Adj . R^ values 
descended . 

The Mallows Cp criteria has also been recommended for 
selecting the best subset of predictor variables in contrast to the 
stepwise methods using a sample data set (Tracz, Brown, & Kopriva, 
1991; Zuccaro, 1992) . The Cp statistic measures the effect of 
underf itt inq ( important predictors left out of the model ) or 
overf itting (include predictors that make no contribution or are 
marginal). Mallows (1966; 1973) has suggested that the selection 
of the best subset model with the lowest bias is indicated by the 
smallest Mallows Cp criteria, especially in the presence of 
multicollinearity . The SAS package (Freund & Littell, 1991) 
currently prints the Mallows Cp value and a variance inflation 
factor (VIF) which can be used to determine which variables may be 
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involved in the mult icollinearity . Pohlmann (1983) had previously 
noted that mult icollinearity among a set of predictor variables 
didn't affect the Type I error rate, but did affect the Type II 
error rate and width of the confidence interval. His findings 
suggest that sample size and model validity could compensate for 
mult i col linear i':y effects, especially when certain research 
questions require models with highly correlated predictors, e.g. 
Y = pX; + p:X-, + e. 

The principal component regression (PGR) approach has also 
been proposed as a criteria for selecting the best predictor model. 
This method appears to be useful when predicting values in one 
sample based upon estimates from another sample and when 
mult icoll inear i ty exists among a set of variables (Morrison, 1976) . 
The rationale for using a PGR approach is when the mean squared 
error of a biased est imate is smaller than the variance of an 
unbiased estinate. The PGR method, however, is not appropriate for 
multiple regression subset models containing interactions (Aiken & 
West, 1993) nor when models depict nonlinear correlated predicter 
sets. The PGR method creates a set of new variables called 
principal components, v;hich are uncorrelated or orthogonal, and 
therefore preclude it from being used in these types of models. 
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Surnmary 

The all possible subset approach is being recommended as an 
alternative over stepwise methods for selecting the best set of 
predictor variables. The Mallows Cp criteria or a principal 
components regression approach is being advocated for determining 
the best subset model over the use of R% especially when the 
predictors are correlated. The principal component regression 
method, which determines the best model for prediction by creating 
orthogonal variables, appears more useful when estimates from one 
sample are used to predict in another sample or multicollinearity 
exists arriong the predictors. 

How do these criteria compare when selecting the best subset 
model? When might a researcher choose one criteria over another 
for selecting the best model? A comparison of the Mallows Cp 
selection criteria upon cross validation and a comparison of the 
parameter estimates and standard errors between the multiple 
regression and the PGR approach should shed further light on their 
usefulness for subset model selection. An applied example will 
further elaborate the comparison of the two criteria. 
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METHODS AND PROCEDURES 



Simulat ion 

An SAS program generated a heuristic population (n = 10,000 
observations) with a dependent variable and ten correlated 
predictor variables (Appendix) . The program then randomly sampled 
the population data set for n = 200 observations. This data set 
was then randomly divided to create two separate data sets of equal 
size (nl = n2 = 100 observations). 

The population correlation matrix, variable means and standard 
deviations are in Table 1. The correlation matrix and variable 
means and standard deviations for the sample data set used to 
compute the parameter estimates is in Table 2. The correlation 
matrix and variable means and standard deviations for the cross 
validation data set are in Table 3. Parameter estimates, computed 
using the ordinary least squares criterion from the first data set, 
were used with the second data set to calculate R" and the Mallows 
Cp values, and to determine the best variable subset models. 



Insert Tables 1,2,3 
Here 



Table 4a indicates the model subset selection for each sample 
data set . Table 4b indicates a comparison between the R' and 
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Mallows Cp values from the estimation sample data set to the cross 
validation sample data set using parameter estimates from the 
estimation sample. The Mallows Cp values were inflated because the 
parameter estimates applied to the second data set altered the 
residual sums of squares used in the formula to calculate it. 
Although the relative ordering of Cp values were the same, these 
values did not indicate the same single best variable subset model 
in the second data set. 



Insert Tables 4a and 4b 
Here 



Table 5 compares the parameter estimates between the Mallows 
Cp and the principal components regression method for each best 
variable subset model. The R- values will be the same regardless 
of which method is used, the real difference is seen when comparing 
the relative significance of the parameter estimates. The Mallows 
Cp method with correlated predictors indicated that all the 
parameter estimates were significant. This was not the case in the 
principal components regression approach. An applied example will 
further illustrate this distinction between the two methods. 



Insert Table 5 
Here 
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APPLIED EXAMPLE 

Subi ects 

Participants in the study were a cohort of students accepted 
into the Texas Academy of Mathematics and Science (TAMS) at the 
University of North Texas in Fall, 1993. TAMS is an early college 
entrance program in which students earn approximately 60 hours of 
college credit by taking University of North Texas courses. 
Students enter TAMS at the beginning of their lUch year in high 
school. They live on campus in a special residence hall and take 
regular university courses in mathematics, science and the 
humanities. After two years, participants receive a special high 
school diploma and have amassed at least 60 hours of college 
credit. Each year approximately 200 high school sophomores, who 
have met the selection criteria and completed the 10th grade, are 
accepted into the Texas Academy of Mathematics and Science. 

In the study year, TAMS accepted 204 students. Of these, 156 
students attended an August orientation, which occurred a week 
prior to their first semester of college coursework, and completed 
the LASSI . There were 80 females and 76 males who participated in 
the study. The students who took the LASSI were similar in 
demographic background and academic ability as previous classes 
because of the academy's consistent admission requirements and pool 
of applicants. The participants SAT-M and SAT-V means and standard 
deviations, respectively, were: M=651, SD=57 ; andM=530, SD=75. 

« JO 
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Instrument 



The LASSI is an English language assessment tool designed to 
measure college students' use of learning and study strategies. It 
was designed to provide assessment and pre-post achievement 
measures for students participating in a learning strategies and 
study skills project. A high-school version is available, but it 
was not recommended for use with accelerated students in these 
programs (Eldredge, 1990). The LASSI can be administered in a 
group setting in approximately 30 minutes. The carbonless test 
format allows participants to score their own assessment and take 
a copy of the results with them from the testing session. 

The LASSI 's ten subscales focus on thoughts and behaviors 
related to successful learning. The ten subscales are (1) 
attitude; (2) motivation; (3) time management; (4) anxiety; (5) 
concentration; (6) information processing; (7) selecting the main 
ideas; (8) study aids; (9) self -testing; and (10) test strategies 
{for more details see, Weinstein, 1987). Reliability studies 
reported Cronbach alpha internal consistency values ranging from 
.70 to .86 and test-retest reliabilities from .70 to .85. Validity 
studies have also reported normative data for high school and 
college students with different instruments for each group 
(Weinstein, Palmer, & Schulte, 1987). Students respond to 
individual items on each subscale using a five-point scale: (5) 
very typical of me; (4) fairly typical of me; (3) somewhat typical 
of me; (2) not very typical of me; and (1) not at all typical of 
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me. Some item values are reverse keyed before being added to 
obtain a subscale score . The subscale scores are compared by 
graphing them onto a normal curve equivalent percentile chart. 

According to the LASSI user's manual (Weinstein, 1987^ , 
students scoring above the 7 5th percentile do not need to improve" 
that specific skill or strategy. Students scoring between the 75th 
percentile and the 50th percentile should consider im.provemenr: . 
Students scoring below the 50th percentile on a subscale need 
assistance to improve that skill or strategy. For example, 
students scoring below the 50th percentile on the anxiety snbsc^l^^ 
would be considered anxious about being in college. Likewise^ 
students scoring below the 50th percentile on the motivati^jn 
subscale lack appropriate motivation to do college level work 
effectively . 

Research Question 

The research question of interest was whether the ten LASSI 
subscales could predict a student's college grade point average 
after one semester of college coursework. A related question 
pertained to whether a "subset" of the ten LASSI subscales could 
better predict college grade point average for this sample of 
students. Students not maintaining at least a 2.50 grade point 
average after one semescer of college coursework were dismissed 
from the Academy. Knowledge of which subscales are best predictors 
of college grade point average would aid staff in identifying 
potential at-risk students upon entering the Academy. 

o 12 
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Data Analysis 

The data were analyzed using a SAS statistical program 
(Appendix). The student's college grade point average was 
predicted by the ten LASSI subscales using PROC REG with the 
SELECTION statement requesting the best subset model criteria. The 
PROC PRINCOMP procedure was used to create ten orthogonal principal 
component variables. The principal component variable parameter 
estimates were then computed using the PROC REG procedure. The 
number of significant principal component parameter estimates were 
subsequently identified. These procedures are outlined in the SAS 
System for Regression manual (Freund & Littell, 1991) . 

RESULTS 

The correlation matrix, means and standard deviations of the 
ten LASSI subscales are in Table 6. The intercorrelat ions among 
the subscales indicated that Anxiety /Worry was not significantly 
correlated with Time Management, Information Processing, Support 
Techniques/Materials, and Self Testing/Class Preparation. The 
lowest subscale mean was on Selecting Main Ideas. 



Insert Table 6 Here 



13 



11 

Mallows Cp 

The Mallows Cp statistic is calculated as: Cp = (SSEp/MSE) - 
(n ' 2p) + 1 (Freund Sc Littell, 1991) or Cp = [1/(5' (RSSp)- n + 2p] 
(Mallows, 1973) ; where RSSp is the residual sum of squares from the 
best variable subset model, MSE and/or is the mean square error 
from the full model with all predictor variables, n = sample size, 
and p = number of predictors. 

The procedure for finding the optimum subset of all possible 
subset sizes requires computing 2"" equations. The ten subscale 
predictors in the model yielded 1024 regression equations (2'^) with 
associated selection criteria statistics {Note: the determination 
of the number of subset equations generated for predictor 
variables from an m variable full model is: ml / [p 1 (m-p) 1 ] . For 
example, the number of 2 variable subset equations generated from 
a 10 variable model would be 45} . Only che single best variable 
subset models of each size are reported. 

The best subset model for each subset size with the 
corresponding criteria are in Table 7. The Mallows Cp of 2.72 
indicated a four variable subset model. The four variable subset 
model for predicting college grade point average consisted of the 
four subscales: Motivation (2), Anxiety /Worry (4), Support 
Techniques/Materials (8), and Self Testing/Class Preparation (9). 

The Cp criteria also indicated the overfitting caused by 
having too many variables in the model . The large Cp values 
indicated equations with larger mean square error. If Cp > (p + 
1), for any subset size p, then bias was present. If Cp < (p + 
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1) , for any subset size p, then the model contained too many 
variables. A plot of the Cp values against the number of 
predictors, compared to a plot of the (p + 1) values, visually 
displays this phenomenon (Figure 1) . 



Insert Table 7 and Figure 1 Here 



The present pattern of Cp values for the various subsets of 
size 2^ are typical when mult icollinearity is present. The Cp 
values initially become smaller, but then start to increase. The 
plot of Cp values is similar to a "scree" plot in factor analysis 
and as such a multiple regression method might also be useful in 
determining the number of variables to retain (Zoski & Jurs, 
1993) . The best subset model is indicated when the Cp values 
begin to increase and cross the (p + 1) values (Figure 1) . 



Principal Components Regression 



Principal components are obtained by computing eigenvalues 
from the correlation matrix. The correlation matrix is used so 
that variables are not affected by the scale of measurement as in 
the use of a variance-covariance matrix. Since eigenvalues are 
the variances of the principal component variables, the sum of 
the eigenvalues equal the number of variables in the full model, 
just as the sum of standardized variable variances would equal 
the number of variables. This sum is the measure of the total 
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variation in the data set. A wide variation in the eigenvalues 
would suggest the presence of multicollinearity among the 
variables. The number of eigenvalues greater than unity, as in 
factor ana lysis , would indicate the number of variables from the 
full model that would explain most of the variance in the data 
set. The eigenvectors, in contrast, contain the coefficients for 
each principal component variable. These coefficients are used 
to create the observed values of the original variables. These 
observed values are then used in multiple regression as 
orthogonal predictor values with no multicollinearity present. 

Preliminary inspection of the model components (Type II SS) 
in Table 8 indicated three principal component variables (1,4, 
and 8) that accounted for 69 % of the variance in predicting 
college grade point average (7.42/10.76). The first model 
component alone explained 39 % of the variance (4.16/10.76). 

A comparison of the full model parameter estimates in Table 
9 between the original correlated predictors and the principal 
component regression variables sheds better insight into the best 
variable subset model selection criteria. The multiple 
regression analysis with correlated predictors identified 
motivation (2) and support (8) while the principal component 
method identified attention (1) , anxiety /worry (4) , and support 
(8) . 



Insert Tables 8 and 9 Here 
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SUISMARY 

The Cp criteria identified a four variable predictor model 
as best: motivation (2), anxiety /worry (4), support (8), and 
class preparation (9). This four variable subset model was 
further verified by examining where the plot of Cp values against 
the (p + 1) values crossed. The Cp criteria selected the 
smallest variable subset model in the presence of variable 
mult icollinearity . The principal components approach identified 
attention (1), anxiety /wori-y (4), and support(8). In examining 
the parameter estimates in the multiple regression analysis, only 
motivation (2) and support (8) were significant relative to the 
other predictors in the model. The Mallows Cp and PCR criteria 
indicated slightly different sets of predictor variables 
depending upon whether the independent variables were correlated. 

In using multiple regression it is important to have a 
theoretical basis for the regression model and to consider model 
validation. A common misconception in multiple regression is 
that the model with all the significant predictors included is 
the best model. This isn't always the case. The problem is that 
the b's and values are data dependent due to the least squares 
criterion being applied to a specific sample of data. A 
different sample will usually result in different parameter 
estimates and variance explained. Although the standard errors 
of the b's do provide the researcher with some indication of the 
amount of change expected from sample to sample, the fact remains 
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that the estimates obtained from one sample may predict poorly 
when applied to a new set of sample data. The primary method to 
assess any change in estimates is to replicate the regression 
model using other sample data. The Mallows Cp criteria was 
similarly suspect because values were inflated upon cross 
validation and the best variable subset model in one sample was 
not identified in the other sample. Obviously, if the mean 
square error estimates and the residual sums of squares 
fluctuate, then model selection will be erroneous (see Mallows Cp 
formula) . 

The rationale behind a regression model is to estimate a^' 
(the true model's mean square error variance) . Since a^" is not 
generally known, a researcher must estimate it from a knowledge 
of prior research (a^ = J , obtain estimates from a model 
containing all theoretically relevant predictors, replicate the 
study, or use bootstrapping, jacknifing, and cross-validation 
methods. In this regard, effect size considerations, as 
recommended by Thompson et al . (1991), become important to 
consider in evaluating a regression model. 
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Table 5. Mallows Cp and principal components regression comparison 
(sample 1 , n=100 ) . 



Best Variable 

Subset Model Mallows Cp Principal Components R- 
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Table 5 (continued) . 

Mallows Cp and principal components regression comparison 
(sample 1, n=100) . 



Best Variable 

Subset Model Multiple Regression Principal Components R' 
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Note: Regression parameters have been rounded to 2 decimal 
places unless otherwise noted. The t value = (3 / SE 
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Table 6. LASSI Subscale inter-correlations, means and standard deviations 
(n = 156) . 



LASSI Subscale 
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Note: The values have been rounded to the nearest hundreths . 
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Figure 1. Overlay plot of Cp and (p + 1) values. 
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APPENDIX 



POPULATION. SAMPLE, PROC REG and PROC IML PROGRAM 
* create population data set with y and 10 x variables; 

data random; 
drop n ; 

do n=l to 10000; 
y =10 + sqrt (4) *normal {473123 ) ; 
xl= 8+sqrt (16) *normal (897245) 
x2= 6 + sqrt (64) *norrnal (987214) 
x3= 9+sqrt (32) *normal (123935) 
x4=12+sqrt (18) *normal ( 83 9857 ) 
x5 = 18 + sqrt (20) *normal ( 897245 ) 
x6=l 6+sqrt (40) *normal ( 9872 14 ) 
x7 = 2 9 + sqrt (70) *normal ( 123 935 ) 
x8=32+sqrt(13) *normal ( 83 9857 ) 
x9=2 4+sqrt (45) *normal (123 935) 
xl0=2+sqrt(62) *normal(839857) 

id = n; 
output ; 
end; 

♦population correlation matrix and regression analysis; 

proc corr;var y xl-xlO; 

proc reg outest=est ;model y = xl-xlO/ 

selection=rsauare cp best=l; 
proc plot; plot _cp_ * _in_ = 'C _p_ * _in_ = '* '/overlay; 

* randomly sample 200 subjects from population data set; 

data samplel; 
retain k 2C0 n; 
if _n_ = 1 then n=total; 
int = 1 ; 

set random nobs=total; 

if ranuni (4740080) <= k/n then 

do ; 

output ; 

k = k+1; 
end; 
n=n+l ; 

if k = 400 then stop; 
id = _n_; 
drop k n; 

* create two randomly sampled data sets of size 100; 

* repeat correlation and regression programs for each; 



+ y; 
+ y; 
+ y; 
+ y; 
+ y; 
+ y; 
+ y; 
+ y; 
+ y; 
+ y; 



ERIC 



■13 



33 

* randomly select 100 subjects for estimation data set; 

data sample2; 
retain k 100 n; 
set samplel nobs=total; 
if _n_ = 1 then n=total; 
select = 0; 

if ranuni (716549) <= k/n then 
do ; 

select = 1; 

output ; 

k = k+1; 
end; 
n=n+l ; 

if k = 200 then stop; 
proc corr data=sample2 ; var y xl~xl0; 
proc reg data=sample2 outest=est 1 ;model y = xl-xlO/ 

select ion=rsquare cp best=l; 
proc plot;plot _cp_ * _in_ = 'c' * _in_ = /overlay; 

* select remaining 100 subjects for cross validation data set; 



data sample3; 

merge samplel sample2; by id; 

if select = 1 then delete; 
' proc corr data=sample3 ; var y xl-xlO; 
proc reg data=sample3 outest=est2 ;model y = xl-xlO/ 

select ion=rsquare cp best=l; 
proc plot; plot _cp_ * _in_ = 'c' __p_ * _in_ = '*' /overlay; 



proc iml ; 



* This is a full model using the 10 variable estimation sample; 
use sample2; 

read all var {y} into y; 

read all var {int xl x2 x3 x4 x5 x6 x7 x8 x9 xlO} into x; 
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N=NROW ( X ) ; 
K=NCOL (X) ; 
XPX=X' *X; 
XPY=X' *Y; 

ypy = y'*Y; 

XPXI=INV(XPX) 
B=XPXI*XPY; 
c = inv (xpx) ; 



bHAT=c 



X ' 



'Y; 



'Y - bhat'*x'*Y; 



sse= ; 
ybar = suin(y) / n; 
yhat = X * b; 
ssr = ssq(yhat - ybar) ; 
sst = sse + ssr; 
dfe = n -k; 
MSE102=SSE/DFE; 
r2t = ssr / sst; 
r2p = r2t; 

cp = ( (1 / msel02) * sse 
print "This is the Full 



/* 
/* 
/* 



/* 
/* 



number of observations 
number of variables 
cross-products 



inverse crossproducts 
beta weights 



/* sum of squares error 



/* 
/* 
/* 
/* 
/* 
/* 



predicted y values 
sum of squares regression 
sum of squares total 
degrees of freedom error 
mean squared error 
rsquare 

;n - 2*k); /* mallows cp 



*/ 
*/ 



*/ 
*/ 



*/ 
*/ 
*/ 
*/ 
*/ 
*/ 

*/ 



model for estimation sample data" r2p cp; 
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* Re-estimate using cross validation sample datab- 
ase samples ; 

read all var {y} into y; 

read all var {int xl x2 x3 x4 x5 x6 x7 x8 x9 xlO} into x; 



N=NROW(X) ; 


/* 


number of observations 


*/ 




K=NCOL(X) ; 


/* 


number of variables 


*/ 




XPX=X' *X; 


/* 


cross -products 


*/ 




XPY=X' *Y; 










ypy = y ' *Y; 










XPXI = IIMV(XPX) ; 


/* 


inverse crossproducts 


*/ 




c = inv ( xpx) ; 










bHAT=c * x'*Y; 


/* 


predicted values 


*/ 




sse= y'*Y - bhat'*x'*Y; 


/* 


sum of squares error 


*/ 




ybar = sum(y) / n; 










yhat = X * b; 


/* 


predicted y values 


* / 




ssr = ssq(yhat - ybar); 


/* 


sum of squares regression 


*/ 




sst = sse + ssr; 


/* 


sum of squares total 


*/ 




dfe = n -k; 




/* degrees of freedom error 


*/ 


MSE103:^SSE/DFE; 




/* mean squared error 




*/ 


r2t = ssr / sst; 




/* rsquared 




*/ 


r2p = r2t; 










cp = ((1 / msel03) * sse) 


- (n 


- 2*k); /* mallows cp 




*/ 



print "This is the Full model for cross validation sample data" r2p cp; 
* This -is a 9 variable estimation model; 
use sample2; 

read all var {y} into y; 

read all var {int xl x2 x3 x4 x5 x6 x8 x9 xlO} into x; 



N=NROW(X) ; 
K=NCOL(X) ; 
t=k; 

XPX=X'*X; 
XPY=X' *Y; 

ypy = y'*Y; 

XPXI=INV(XPX) ; 

B=XPXI*XPY; 

c = inv (xpx) ; 

bHAT=c * x'*Y; 

?ee= y ' *Y - bhat ' *x' *Y; 

ybar = sum(y) / n; 

yhat = X * b; 

ssr = ssq(yhat - ybcir) ; 

sst = sse + ssr; 

r2p = ssr / sst; 

cp = ( (1 / msel02) * sse) 

print "This is for" k 



*/ 
*/ 



/ * number of observations 

/ * number of variables 
/ * cross-products 



/ * inverse crossproducts 



/* predicted values */ 

/* sum of squared errors */ 

/* predicted y values */ 
/* sum of squares regression */ 

/ * rsquared * / 

- (n - 2*k) ; /* mallow cp */ 

estimation model sample data" r2p cp; 
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* Re-estimate using cross validation sample data; 
use samples ; 

read all var {y} into y; 

read all var {int xl x2 x3 x4 x5 x6 x8 x9 xlO} into x; 

N=NROW(X) ; /* number of observations "^Z 

K=NCOL(X); /* number of variables */ 

t=k; 

XPX=X'*X; /* cross-products */ 

XPY=X'*Y; 

ypy = y'*Y; 

XPXI=INV(XPX) ; /* iaverse crossproduct s */ 

c = inv (xpx) ; 

bHAT=c * x'*Y; /* pr^-dicted values ^/ 

sse- y'*Y - bhat'*x'*Y; /* sum of squares error */ 

ybar - sum(y) / n; 

yhat = X * b; 

ssr = ssq(yhat - ybar) ; 

sst = sse ssr; 

r2p = ssr / sst; 

cp = ((1 / msel03) * sse) - (n - 2*k) ; 

print "This is for" k " cross validation model sample data" r2p cp; 



Repeat the above two steps of sas code in the 9 variable subset model 

replacing the read all var statement for x with the remaining 8 variable 

subset model, then repeat for the 7 variable subset model, etc. down to 
the 1 variable subset model. 

For example: 

* This is an 8 variable estimation model; 

read all var {int xl x2 x3 x5 x6 x8 x9 xlO} into x; 

* Re-estimate using cross validation sample data; 
read all var {int xl x2 x3 x5 x6 x8 x9 xlO} into x; 
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SAS STATISTICAL PROGRAM 

Applied Example 



DATA LASSI;INFILE ' A ; \ LASSI . DAT ' ; IF STATUS=1; 

INPUT SEX 27 SATM 31-33 SATV' 35-37 STATUS 55 CGPA 67-71 

#2 {Q1-Q77) (77*1.0) #3; 
LABEL CGPA = 'FIRST SEMESTER COLLEGE GPA' ; 
* STATUS 1= 'CURRENT STUDENT' 2= "WITHDREW; 



* SEX 




1 = 


'FEMALE' 2 




'MALE 
















PRE ATT = 


Q5 


+ 


Q14 


+ 


Q18 


+ 


Q29 


+ 


Q3 8 


+ 


Q45 


+ 


Q51 


+ 


Q69 


PREMOT = 


: QIO 


+ 


Q13 


+ 


Q16 


+ 


Q28 


+ 


Q3 3 


+ 


Q41 


+ 


Q49 


+ 


Q56 


PRETMT = 


= Q3 


+ 


Q22 


+ 


Q36 


+ 


Q42 


+ 


Q4 8 


+ 


Q58 


+ 


Q66 


+ 


Q7 4 


PREANX = 


= Ql 


+ 


Q9 


+ 


Q25 


+ 


Q31 


+ 


Q3 5 


+ 


Q54 


+ 


057 


+ 


Q63 


PRECON = 


= Q6 


+ 


Qll 


+ 


Q39 


+ 


Q43 


+ 


Q4 6 


+ 


Q55 


+ 


Q61 


+ 


Q68 


PREINP ^ 


^ Q12 


+ 


Q15 


+ 


Q23 


+ 


Q32 


+ 


Q4 0 


+ 


Q47 


+ 


Q67 




Q7 6 


PRESMI = 


= Q2 


+ 


Q8 


+ 


Q60 


+ 


Q72 


+ 


Q77; 














PRESTA = 


= Q7 


+ 


Q19 


+ 


Q2 4 


+ 


Q44 


+ 


Q50 


+ 


Q53 


+ 


Q62 


+ 


Q7 3 


PRESFT = 


= Q4 


+ 


Q17 


+ 


Q21 


4- 


Q26 


+ 


Q30 


+ 


Q37 


+ 


Q65 


+ 


Q7 0 


PRETST : 


= Q20 


+ 


Q27 


+ 


Q34 


+ 


Q52 


+ 


Q5 9 


+ 


Q64 


+ 


Q71 


+ 


Q7 5 



LABEL PREATT = 'ATTITUDE AND INTEREST' 

PREMOT= 'MOTIVATION' 
PRETMT= 'TIME MANAGEMENT' 
PREANX= 'ANXIETY AND WORRY' 
PRECON= 'CONCENTRATION AND ATTENTION' 
PREINP= ' INFORMATION PROCESSING' 
PRESMI= 'SELECT MAIN IDEAS' 

PRESTA= 'SUPPORT TECHNIQUES AND MATERIALS' 
PRESFT= 'SELF TESTING AND CLASS PREPARATION' 
PRETST= 'TEST STRATEGIES'; 

PROC REG; MODEL CGPA = PREATT- - PRETST; 

PROC REG OUTEST=EST; MODEL CGPA = PREATT- - PRETST/ 
SELECTION = RSQUARE CP BEST=1; 

PROC PLOT; PLOT _CP_ * _IN_ = 'C _P_ * _IN_ = '*' /OVERLAY 
VAXIS= 0 TO 12 BY 1 HAXIS = 0 TO 10 BY 1; 

PROC PRINCOMP DATA=LASSI OUT=PRIN;VAR PREATT- -PRETST; 

PROC REG; MODEL CGPA = PRINl -PRINIO /SS2 ; 

RUN; 
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